---
title: Web Crawler
sidebarTitle: Web Crawler
---

In this section, we present how to use a web crawler within MindsDB.

A web crawler is a computer program or automated script that browses the internet and navigates through websites, web pages, and web content to gather data. Within the realm of MindsDB, a web crawler can be employed to harvest data, which can be used to train models, 
domain specific chatbots or fine-tune LLMs.

## Prerequisites

Before proceeding, ensure the following prerequisites are met:

1. Install MindsDB [locally via Docker](https://docs.mindsdb.com/setup/self-hosted/docker) or use [MindsDB Cloud](https://cloud.mindsdb.com/).
2. To connect Web Crawler to MindsDB, install the required dependencies following [this instruction](/setup/self-hosted/docker#install-dependencies).
3. Install or ensure access to Web Crawler.

## Connection

This handler does not require any connection parameters.

Here is how to initialize a web crawler:

```sql
CREATE DATABASE my_web 
WITH ENGINE = 'web';
```

## Usage

### Get Websites Content

Here is how to get the content of `docs.mindsdb.com`:

```sql
SELECT * 
FROM my_web.crawler 
WHERE url = 'docs.mindsdb.com' 
LIMIT 1;
```

You can also get the content of internal pages. Here is how to fetch the content from 10 internal pages:

```sql
SELECT * 
FROM my_web.crawler 
WHERE url = 'docs.mindsdb.com' 
LIMIT 10;
```

Another option is to get the content from multiple websites.

```sql
SELECT * 
FROM my_web.crawler 
WHERE url IN ('docs.mindsdb.com', 'docs.python.org') 
LIMIT 1;
```

### Get PDF Content

MindsDB accepts [file uploads](/sql/create/file) of `csv`, `xlsx`, `xls`, `sheet`, `json`, and `parquet`. However, you can utilize the web crawler to fetch data from `pdf` files.

```sql
SELECT * 
FROM my_web.crawler 
WHERE url = '<link-to-pdf-file>' 
LIMIT 1;
```

For example, you can provide a link to a `pdf` file stored in Amazon S3.
